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1. Introduction 



Abstract. 

We discuss the development of a Java toolbox for astronomical time 
| series data. Rather than using methods conventional in astronomy (e.g., 

Q . power spectrum and cross-correlation analysis) we employ rule discovery 

techniques commonly used in analyzing stock-market data. By clustering 
patterns found within the data, rule discovery allows one to build pre- 
dictive models, allowing one to forecast when a given event might occur 
or whether the occurrence of one event will trigger a second. We have 
tested the toolbox and accompanying display tool on datasets (represent- 
ing several classes of objects) from the RXTE All Sky Monitor. We use 
these datasets to illustrate the methods and functionality of the toolbox. 
(N| ■ We also discuss issues that can come up in data analysis as well as the 

possible future development of the package. 
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6. 

^ , Many types of variable objects exist in the universe, including stars with pre- 

dictable behavior (e.g., Cepheids), objects with behavior that is inherently un- 
predictable (e.g, AGN), and objects with both predictable and irregular vari- 
ability patterns (e.g., X-ray binaries). Constant monitoring of variable objects 
^ , has been a continuing interest in astronomy, beginning with 16th century as- 

^ ' tronomer David Fabricius, and extending through history to Herschel, Leavitt 

and others. Today, monitoring is done by a wide variety of techniques, ob- 
servers and instruments, from dedicated amateurs, to professional astronomers 
interested in intensive monitoring of individual objects, to all-sky monitors such 
as the RXTE ASM and BATSE aboard CGRO. 

While they are a new tool, already all-sky monitors have have made impor- 
tant, if not decisive contributions to solving some of astronomy's most persistent 
mysteries, such as the cosmological origin of gamma-ray bursts and linking emis- 
sion regions in AGN. With major initiatives such as the Large-Area Synoptic 
Survey Telescope (LSST) and Supernova Acceleration Probe (SNAP), all-sky 
monitors are poised to become a major discovery tool in astronomy. To max- 
imize the utility of large monitoring programs, it is important to devise ways 
of handling large amounts of data in real time and find not only variability but 
also predictive patterns among these large data streams. It was with these goals 
in mind that we undertook this project. 
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Raw Time Series 



Handle missing data by interpo- 
lating values. 

Other smoothing techniques may 
he applied to de-noisette signal. 
DWT Techniques maybe applied, 
to smoothen the curve and inter- 
polate the missing data. 
Normalization should he done in 
order to avoid biasing the rules 



Find the basic shapes that occur in 
the signal. Windowing technique is. 
used to find the most frequently 
occurring patterns. 



Rule filtering basLicailymeans 
select tngthose rules that can he 
generalized. We do this by keeping 
all those rules that have a high 
f requency and confidence value 
associated with it. 
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Cluster the occurrences of* the basic 
shapes discovered in the discretization 
step. Also evaluate the cluster repre- 
sentative for all the occurrences in any 
given cluster. We use simple Euclid- 
ean distance as a measure of similarity. 
MinimLim duster distance is a parame- 
ter that can be varied by the user. 



Rule discovery deals with evaluating the 
rules "A— > D in T dapf This is done 
by using the information from the cius- 
tered occurrences of the basic patterns. 
Each rule has a frequency and aeon fi- 
dence value associated with it 



Figure 1. A flow chart describing the basic rule discovery algorithm 



2. Clustering and Rule Discovery 

The need to understand time series data is not unique to astronomy: time series 
data exists in a wide variety of fields including geology and atmospheric science 
as well as many business applications. The common problem is how to efficiently 
find patterns in the data. Fourier transform and power spectrum methods, 
commonly used in astronomy research, are well suited for finding patterns with 
well-determined periodicities (e.g., see Scargle 1997). However they may be 
less helpful for objects with irregular behavior, and are not optimized to serve 
as predictive tools. Our approach was to employ clustering and rule discovery, 
which are optimized for finding patterns that do not rely on a regular periodicity. 

We first attempt to represent the dataset as a collection of patterns, by 
sliding a window of size through the time series to get subsequences. We then 
cluster these points using the greedy clustering algorithm explained in Das, 
Gunopulos & Mannila 1997). Once a good fit is achieved, these clusters can 
then be considered as the basic shapes of the time series, with the entire series 
composed of superpositions and/or combinations of these basic shapes. The next 
and main step of the process is to find interesting probabilistic rules between the 
clusters in the two time series. These rules are of the form: "If a cluster A occurs 
in time series 1 then we can say with confidence c that B will occur in time series 
2 within time T". Figure 1 displays a flow chart of the basic algorithm used in 
predictive time series analysis. 



3. Data Analysis Structure and Usage 

Two goals are possible with this approach, each requiring a slightly different 
analysis tree: (1) to schedule a telescope more intelligently to objects when they 
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Planner 




Figure 2. Flow charts for the possible use of this algorithm in 
scheduling a telescope. 

are at their brightest or most interesting stage, or (2) as an analysis tool to find 
patterns within a (possibly multivariate) time series (see Figure 2). 

Within the time-series analysis (TSA) module, the user can control three 
main parameters: the window size, the minimum cluster distance and the time 
period for prediction. The TSA module analyzes the time series data and per- 
forms trend prediction and rule discovery on both real time and historical data. 
This output can either be used directly for data analysis in a paper or to fore- 
castboth the activity level of an object and the ability of the telescope or satellite 
to perform the desired observations. All of these factors could be considered in 
the light of the constraints imposed by individual investigators as well as the re- 
view committee. The goal of such a process would be to maximize the scientific 
usefulness of all the observations. If, for example, a given observation required 
that a source be above a certain flux level, such a program might have a better 
chance of ensuring that this was so during the observations. 

4. Results, Limitations and Future Work 

To test and validate the toolbox, we used RXTE ASM datasets for various 
objects, spanning several different object classes. In Figure 3, we show one 
particularly interesting dataset, from SMC X-l. SMC X-l shows interesting 
periodic behavior on several different timescales: a long time-period behavior 
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Figure 3. A screenshot of the Java GUI application that is used to 
display the rules generated. Each color represents an occurrence of 
a basic pattern. As each rule is cicked on, the GUI shows the rule's 
frequency and associated confidence values. 



as well as short time-scale variations superposed upon the long-period behavior, 
but occurring only when the object is bright (see e.g., Kahabka & Li 1999). The 
toolkit was found to be particularly effective for predicting short-period rules. 
Longer-period rules can be found by varying the window size, although possibly 
at the expense of losing short time-scale rules as the window size becomes larger 
than the rules in question. We can also find longer-period rules with smoothing 
techniques. 

The current implementation of the toolkit consist of a rule inference engine, 
programmed in C and a visualizer, programmed in Java. The current toolkit 
has several limitations. First, the current toolkit can analyze only a single time 
series at any time. This obviously gives it limited functionality, as many objects 
are studied with multiple instruments. Second, it has limited ability to deal 
with missing data. In addition, one could envision multiple processing options, 
including the incorporation of dynamic time warping techniques, which could 
be particularly useful to spot quasi-periodic oscillations. 
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